home *** CD-ROM | disk | FTP | other *** search
- 7. BACKGRND: concepts, program design, compressor design, benchmarks
- ====================================================================
-
- Although many aspects of UC look simple and straightforward, UC is a
- very advanced program.
-
- This document contains the following paragraphs:
-
- - A. Inside UC
- - B. Compression technology
- - C. Damage protection technology
- - D. Benchmarking
-
- To jump to a certain paragraph, press the corresponding letter. To
- jump to a specific document/chapter, press the corresponding digit.
- See chapter 1 paragraph E for an overview of all documentation.
-
-
- 7.A INSIDE UC.
- ==============
-
- Here a rough overview of the internals of UC is given. It can give
- those who are interested more insight into what is going on INSIDE
- UltraCompressor II (tm).
-
- If you want more detailed technical information (e.g. for making add-on
- tools) you can always contact us. Please note however we do not intend
- to supply the source code of UC to third parties. Sample add-on tools
- (including their source code) are available on request.
-
- UC is programmed in C++ and assembly language (for the time critical
- sections). It is devised in a number of modules which are briefly
- described in this section.
-
-
- Overview
- --------
- GENERAL CONTROL
- main
- command interpreter
- super manager
- archive I/O
-
-
- SYSTEM SERVICES COMPRESSION SERVICES
- video neuro manager
- lowlevel I/O ultra engine
- memory manager
- virtual memory manager
- RELIABILITY SERVICES
- error handler
- OTHER damage protection
- lister guard
- viewer test
-
-
- General control
- ---------------
- MAIN
- The main module calls all initialization routines and takes care
- of the configuration and help menus.
-
- COMMAND INTERPRETER
- The command line and script files are interpreted in this
- module.
-
- SUPER MANAGER
- Finds out what should happen (e.g. ask the command interpreter,
- scan the disk and the archive), and makes it happen (call the
- datacompressor etc.). This module is the most complex module of
- UC, but it delegates a lot of hard work to other modules.
-
- ARCHIVE I/O
- Here the archive file format is managed. Conversion between
- external (compressed flat file) and internal (object database)
- formats, as well as all kinds of integrity checks are performed.
-
- System services
- ---------------
- VIDEO
- The keyboard, screen and output file (in case of redirection)
- are managed here.
-
- LOWLEVEL I/O
- Here lowlevel file specifics and networking are handled. Also
- this module takes care of file caching, an essential issue in
- making UC perform fast on (floppy) disk and network access.
-
- MEMORY MANAGER (MM)
- This module manages base memory, XMS and EMS. The MM decides
- which component of the program can get memory, and how much
- memory it can get, to optimize overall performance.
-
- VIRTUAL MEMORY MANAGER (VMM)
- This module manages the 'persistent object database' containing
- the central directory of an archive and numerous links needed to
- make processing as efficient as possible. No matter the amount
- of data in the VMM, it can always work (fast) with only a small
- amount of memory. VMM uses the memory manager to optimize its
- performance wherever possible. The VMM also supplies pipeline
- services for decoupling translation (archive I/O) and compression
- logic.
-
- Compression services
- --------------------
- NEURO MANAGER
- This component makes 'neuro models' of collections of files.
- These neuro models allow UC to use similarities between
- different files to compress all files even further. The
- analyzation phase of UC refers to the creation of neuro models
- based on found files. The optimizing phase (which is different
- from the optimize command) refers to the compression of the
- neuro models. One special neuro model (the 'master') is built
- into the compressor. This model is used to improve compression
- of the neuro models. The master contains information about the
- general structure of common file types such as: documents,
- spreadsheets, databases, sources etc..
-
- ULTRA ENGINE
- The bare high speed compressor/decompressor. Based on a neuro
- model the redundancy of a stream of data is analyzed and
- compressed.
-
- Reliability services
- --------------------
- ERROR HANDLER
- The error handler traps DOS system errors and handles them in a
- neat and consistent way. It often allows the user to go to DOS,
- solve the problem, leave DOS and continue where UC has stopped.
- The error handler makes it impossible for the user to specify
- 'ignore', (an unfortunate option DOS offers if something goes
- wrong).
-
- DAMAGE PROTECTION
- This module handles damage protection, damage detection and
- damage recovery.
-
- GUARD
- The guard continuously checks the internal integrity of UC. It
- is often able to prevent damage caused by e.g. software
- conflicts or hardware problems.
-
- TEST (only available in beta versions)
- Tests and interrogates various components of UC during
- execution.
-
- Other
- -----
- LISTER
- Shows contents of archive in various formats.
-
- VIEWER
- Online help viewer.
-
-
- 7.B COMPRESSION TECHNOLOGY.
- ===========================
-
- UC significantly outperforms existing datacompression technology. For
- those interested in background information, we will discuss WHY and
- HOW UC is capable of doing this.
-
- AIP-NL would like to specifically thank the University of Utrecht, Drs.
- Ing. D. Bezemer, Prof. Dr. J. van Leeuwen, Dr. P. van Oostrum and Drs.
- Ing. N.E. de Vries for their significant participation in the design
- of the UC compression engine. The detailed design is on request
- available at the public library of the University of Utrecht.
-
- To begin, we will give a brief explanation of AR002. Most modern
- datacompressors (ARJ, LHA, ZIP, ZOO) are derivatives of AR002. AR002
- was designed by Haruhiko Okumura.
-
- AR002, a two phase compressor
- -----------------------------
- DATA <-> (#1, LZ) <-> (grouping) <-> (#2, Huffman) <-> COMPRESSED
-
- In phase #1 (the 'LempelZiv' phase) 'matches' are located. Matches
- are strings of data which are equal. The string 'hello there hello'
- can be converted to 'hello there '(-12,5) since 'hello' is matched.
- In general ARJ, LHA, ZIP and ZOO can find matches with a maximum
- distance of about 32700, and a maximum length of about 500 (rough
- estimates).
-
- In phase #2 (the 'Huffman' phase) the output of phase #1 is divided
- into blocks. For each block statistics are analyzed (for example
- the compressor notices the character 'e' is used much more often
- than 'q'). From this analysis an optimal translation table
- ('Huffman tree') is generated (e.g. 'e'=001 'q'=100100010). The
- compressed block then consists of the Huffman tree and the
- translated contents of the block.
-
- The storage of the Huffman tree, at the beginning of each block, is
- a complicated matter which we will not cover here.
-
- The number of different values phase #1 can generate is very large
- (e.g. all 256 possible values of a byte, and about 32700 * 500
- different match discriptions). Since it is very inefficient to make
- a Huffman tree for all those possible values, the values are
- grouped, and statistics are gathered for the groups instead of the
- single values. LHA, ZOO, ARJ and ZIP all use various methods to
- realize this grouping.
-
- UltraEngine
- -----------
- The UltraEngine is also an AR002 derivative, but with some
- fundamental enhancements. These enhancements can be divided into
- two major features: the core engine and the neuro manager.
-
- The neuro manager is the most significant enhancement. Ordinary
- datacompression engines always start completely 'blank', knowing
- absolutely nothing about the data they are going to compress. The
- neuro-manager collects generic information about a group of files.
- For example a collection of *.DOC files will all have similar
- characteristics, and share common text strings like 'paper'. The
- neuro-manager collects this common 'knowledge' in a neuro model.
- This neuro model is used to make the core engine much smarter when
- similar files are compressed.
-
- UC has a neuro model ('master') built into the executable as well.
- This master contains information about common file formats. It
- contains for example information about spreadsheets, databases,
- wordprocessors, programming languages, common words in: English,
- French, German and Dutch etc.. This built in information improves
- compression.
-
- The core engine also has some enhancements over the basic AR002
- engine.
-
- The match engine has been replaced by a completely new one. This
- new engine is capable of finding matches at a distance of 64000 and
- with a maximum length of 30000.
-
- Also the way in which blocks of data are gathered for further
- Huffman compression is completely different. Instead of looking at
- one block at a time, the core engine attempts to 'learn' from
- previous blocks so less information is needed to encode statistical
- information of a block.
-
- Alternative
- -----------
- A collection of files can also be compressed more, with the often
- used TAR+COMPRESS 'trick'. First collect all files in a single
- large file, and then compress this single large file at once.
- Especially for large collections of small files, this improves
- compression significantly. This approach can be reached with ARJ as
- well, by first compressing with ARJ -m0, and compressing the
- resulting archive again with ARJ -jm1.
-
- It should be noted, however, that the UltraEngine approach works
- much better. First of all, the compression is better. The neuro
- manager is much smarter than just a simple concatenation. This can
- be verified by compressing the output of ARJ -m0 with UC -tt. Also
- the core engine is fine tuned for optimal performance in
- cooperation with neuro models. The 64000 byte match length is one
- of the essential enhancements.
-
- Another reason why the UC approach is much better, is random access.
- Try to delete a single file of a collection of 5000 C files when
- ARJ -m0/-jm1 has been used, or try to add or extract SOME files. UC
- will mostly do this much faster than ordinary archivers (this is a
- result of better caching and a better file format), and remarkably
- faster than a collect/compress archiver.
-
-
- 7.C DAMAGE PROTECTION TECHNOLOGY.
- =================================
-
- When an archive is 'damage protected' with the P command, it becomes
- resistant to a certain amount of damage. This paragraph explains how
- UC achieves this.
-
- We will start with a mathematical operator called 'XOR'. XOR has the
- following properties:
-
- 0 XOR 0 = 0; 0 XOR 1 = 1; 1 XOR 0 = 1; 1 XOR 1 = 0
-
- if a1 XOR a2 XOR a3 XOR a4 XOR a5 = xx
- then a1 XOR xx XOR a3 XOR a4 XOR a5 = a2
- and a1 XOR a2 XOR a3 XOR xx XOR a5 = a4
- etc.
-
- This exactly shows the essence of damage protection, a1..a5 are the
- bare data, xx is special (damage recovery) information which can be
- used to restore bare data if it somehow has been destroyed.
-
- In practice UC XOR's groups of 512 bytes (sectors) instead of single
- bits. To determine if a specific sector is correct or damaged a
- checksum of all sectors is included in the damage recovery
- information.
-
- About 1% of damage protection information is added to an archive. For
- short files this is a single recovery sector (meaning up to 1 damaged
- sector can be recovered). But if the archive is longer multiple damage
- recovery sectors are added, for example one for odd and one for even
- sectors.
-
-
- 7.D BENCHMARKING.
- =================
-
- Benchmarking is a very complicated matter. It is always possible to
- create a legitimate looking benchmark which makes product A or product
- B look better. A nice example of this can be seen by just looking at
- some ads.
-
- Some products are even optimized for running benchmarks. An
- unfortunate habit which makes it even more difficult to compare
- products.
-
- We have done a successful attempt to optimize UC for daily 'real life'
- use. This means it compresses fast and it decompresses very fast.
-
- But when archives are used intensively, the bare compression/
- decompression speed is only an aspect of the whole picture. What is
- also very important is how efficient updates are performed. Especially
- when 'the going gets tough', for example when a large archive resides
- on a floppy disk, or when the archive is on a network, UC can be much
- faster than other products.
-
- An unfortunate problem in benchmarking archivers is the 'Calgary
- Compression Corpus', which is often used to compare datacompressors.
- Well, UC will do noticeably better than other compressors with this
- test, yet the real power of the compression engine is not used in this
- unrealistic test. When collections of 'real life' files (e.g. a large
- collection of documents, or sources, or the average archive one finds
- on a BBS) is compressed the difference between UC and other products
- becomes much more significant.
-
- Of course, although performance is very important, it is only one
- aspect in the usability of a product.
-
- In general it is our opinion that you, the user, are the best person
- to perform benchmarks. Try the product the way you would actually USE
- it and see for yourself.
-